Exploratory Data Analysis on Movie Dataset Using Python


About the Dataset

The dataset used for this analysis is scraped from IMDb, a popular online database of movies, and encompasses a wide range of films released between 1980 and 2020. The inclusion of over 7000 movies in the dataset offers a comprehensive perspective on the movie industry, covering a significant span of four decades. This large sample size enhances the statistical validity of the analysis and enables robust conclusions to be drawn about the relationships between these movie indicators.

dataset source: https://www.kaggle.com/datasets/danielgrijalvas/movies

Introduction

This project focuses on exploring the relationships between various movie indicators, specifically budget, gross revenue, score, genre, and number of votes.

By examining the dataset, this project aims to uncover insights into how these movie indicators are connected and how they influence each other. This exploration will provide valuable information on the financial performance, audience reception, and critical acclaim of movies within the given time frame.

Through this project, it is possible to gain insights into how the budget allocated to a movie impacts its financial success, as measured by its gross revenue. Additionally, the analysis aims to determine the connection between the popularity of a movie, as indicated by its gross revenue, and its overall score, which reflects critical reception or audience ratings.

By exploring these relationships, the project intends to provide a deeper understanding of the dynamics within the movie industry and the factors that contribute to a movie's success or failure. The insights derived from this analysis can be valuable for movie studios, filmmakers, and industry professionals in making informed decisions regarding budgeting, marketing strategies, and overall movie production.

Preparation

In [1]:
#Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns

import re
import cpi

import matplotlib.pyplot as plt
import matplotlib

#Setting the plotting style to 'ggplot' 
plt.style.use('ggplot')

#Setting the figure size
matplotlib.rcParams['figure.figsize'] = (12,8)
In [2]:
#Reading the data
df = pd.read_csv(r'C:\Users\learu\OneDrive\Documents\Portfolio\Movies - Python\movies.csv')

#Querying the data
df
Out[2]:
name rating genre year released score votes director writer star country budget gross company runtime
0 The Shining R Drama 1980 June 13, 1980 (United States) 8.4 927000.0 Stanley Kubrick Stephen King Jack Nicholson United Kingdom 19000000.0 46998772.0 Warner Bros. 146.0
1 The Blue Lagoon R Adventure 1980 July 2, 1980 (United States) 5.8 65000.0 Randal Kleiser Henry De Vere Stacpoole Brooke Shields United States 4500000.0 58853106.0 Columbia Pictures 104.0
2 Star Wars: Episode V - The Empire Strikes Back PG Action 1980 June 20, 1980 (United States) 8.7 1200000.0 Irvin Kershner Leigh Brackett Mark Hamill United States 18000000.0 538375067.0 Lucasfilm 124.0
3 Airplane! PG Comedy 1980 July 2, 1980 (United States) 7.7 221000.0 Jim Abrahams Jim Abrahams Robert Hays United States 3500000.0 83453539.0 Paramount Pictures 88.0
4 Caddyshack R Comedy 1980 July 25, 1980 (United States) 7.3 108000.0 Harold Ramis Brian Doyle-Murray Chevy Chase United States 6000000.0 39846344.0 Orion Pictures 98.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7663 More to Life NaN Drama 2020 October 23, 2020 (United States) 3.1 18.0 Joseph Ebanks Joseph Ebanks Shannon Bond United States 7000.0 NaN NaN 90.0
7664 Dream Round NaN Comedy 2020 February 7, 2020 (United States) 4.7 36.0 Dusty Dukatz Lisa Huston Michael Saquella United States NaN NaN Cactus Blue Entertainment 90.0
7665 Saving Mbango NaN Drama 2020 April 27, 2020 (Cameroon) 5.7 29.0 Nkanya Nkwai Lynno Lovert Onyama Laura United States 58750.0 NaN Embi Productions NaN
7666 It's Just Us NaN Drama 2020 October 1, 2020 (United States) NaN NaN James Randall James Randall Christina Roz United States 15000.0 NaN NaN 120.0
7667 Tee em el NaN Horror 2020 August 19, 2020 (United States) 5.7 7.0 Pereko Mosia Pereko Mosia Siyabonga Mabaso South Africa NaN NaN PK 65 Films 102.0

7668 rows × 15 columns

Data Cleaning

In [3]:
#Checking if there are empty cells
for col in df.columns:
    missing = np.mean(df[col].isnull())
    print('{}: {}%'.format(col, round(missing*100)))
name: 0%
rating: 1%
genre: 0%
year: 0%
released: 0%
score: 0%
votes: 0%
director: 0%
writer: 0%
star: 0%
country: 0%
budget: 28%
gross: 2%
company: 0%
runtime: 0%
In [4]:
# Dropping rows with no values in 'budget' or 'gross' columns as it will affect the analysis
df.dropna(subset=['budget', 'gross'], inplace=True)
df
Out[4]:
name rating genre year released score votes director writer star country budget gross company runtime
0 The Shining R Drama 1980 June 13, 1980 (United States) 8.4 927000.0 Stanley Kubrick Stephen King Jack Nicholson United Kingdom 19000000.0 46998772.0 Warner Bros. 146.0
1 The Blue Lagoon R Adventure 1980 July 2, 1980 (United States) 5.8 65000.0 Randal Kleiser Henry De Vere Stacpoole Brooke Shields United States 4500000.0 58853106.0 Columbia Pictures 104.0
2 Star Wars: Episode V - The Empire Strikes Back PG Action 1980 June 20, 1980 (United States) 8.7 1200000.0 Irvin Kershner Leigh Brackett Mark Hamill United States 18000000.0 538375067.0 Lucasfilm 124.0
3 Airplane! PG Comedy 1980 July 2, 1980 (United States) 7.7 221000.0 Jim Abrahams Jim Abrahams Robert Hays United States 3500000.0 83453539.0 Paramount Pictures 88.0
4 Caddyshack R Comedy 1980 July 25, 1980 (United States) 7.3 108000.0 Harold Ramis Brian Doyle-Murray Chevy Chase United States 6000000.0 39846344.0 Orion Pictures 98.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7648 Bad Boys for Life R Action 2020 January 17, 2020 (United States) 6.6 140000.0 Adil El Arbi Peter Craig Will Smith United States 90000000.0 426505244.0 Columbia Pictures 124.0
7649 Sonic the Hedgehog PG Action 2020 February 14, 2020 (United States) 6.5 102000.0 Jeff Fowler Pat Casey Ben Schwartz United States 85000000.0 319715683.0 Paramount Pictures 99.0
7650 Dolittle PG Adventure 2020 January 17, 2020 (United States) 5.6 53000.0 Stephen Gaghan Stephen Gaghan Robert Downey Jr. United States 175000000.0 245487753.0 Universal Pictures 101.0
7651 The Call of the Wild PG Adventure 2020 February 21, 2020 (United States) 6.8 42000.0 Chris Sanders Michael Green Harrison Ford Canada 135000000.0 111105497.0 20th Century Studios 100.0
7652 The Eight Hundred Not Rated Action 2020 August 28, 2020 (United States) 6.8 3700.0 Hu Guan Hu Guan Zhi-zhong Huang China 80000000.0 461421559.0 Beijing Diqi Yinxiang Entertainment 149.0

5436 rows × 15 columns

In [5]:
#Checking again if there are empty cells
for col in df.columns:
    missing = np.mean(df[col].isnull())
    print('{}: {}%'.format(col, round(missing*100)))
name: 0%
rating: 0%
genre: 0%
year: 0%
released: 0%
score: 0%
votes: 0%
director: 0%
writer: 0%
star: 0%
country: 0%
budget: 0%
gross: 0%
company: 0%
runtime: 0%
In [6]:
#Extracting the year from the released column

def extract_releaseyear(data):
    pattern = r'\b\d{4}\b'  # Regular expression pattern to match 4-digit year
    match = re.search(pattern, data)
    if match:
        return int(match.group())
    else:
        pattern = r'\b\w+ \d{1,2}, \d{4}\b'  # Pattern to match "Month day, year"
        match = re.search(pattern, data)
        if match:
            return int(match.group().split()[-1])
        else:
            pattern = r'\b\d{4}\b'  # Pattern to match 4-digit year
            match = re.search(pattern, data)
            if match:
                return int(match.group())
            else:
                return None

df['year_released'] = df['released'].astype(str).apply(extract_releaseyear)
df
Out[6]:
name rating genre year released score votes director writer star country budget gross company runtime year_released
0 The Shining R Drama 1980 June 13, 1980 (United States) 8.4 927000.0 Stanley Kubrick Stephen King Jack Nicholson United Kingdom 19000000.0 46998772.0 Warner Bros. 146.0 1980
1 The Blue Lagoon R Adventure 1980 July 2, 1980 (United States) 5.8 65000.0 Randal Kleiser Henry De Vere Stacpoole Brooke Shields United States 4500000.0 58853106.0 Columbia Pictures 104.0 1980
2 Star Wars: Episode V - The Empire Strikes Back PG Action 1980 June 20, 1980 (United States) 8.7 1200000.0 Irvin Kershner Leigh Brackett Mark Hamill United States 18000000.0 538375067.0 Lucasfilm 124.0 1980
3 Airplane! PG Comedy 1980 July 2, 1980 (United States) 7.7 221000.0 Jim Abrahams Jim Abrahams Robert Hays United States 3500000.0 83453539.0 Paramount Pictures 88.0 1980
4 Caddyshack R Comedy 1980 July 25, 1980 (United States) 7.3 108000.0 Harold Ramis Brian Doyle-Murray Chevy Chase United States 6000000.0 39846344.0 Orion Pictures 98.0 1980
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7648 Bad Boys for Life R Action 2020 January 17, 2020 (United States) 6.6 140000.0 Adil El Arbi Peter Craig Will Smith United States 90000000.0 426505244.0 Columbia Pictures 124.0 2020
7649 Sonic the Hedgehog PG Action 2020 February 14, 2020 (United States) 6.5 102000.0 Jeff Fowler Pat Casey Ben Schwartz United States 85000000.0 319715683.0 Paramount Pictures 99.0 2020
7650 Dolittle PG Adventure 2020 January 17, 2020 (United States) 5.6 53000.0 Stephen Gaghan Stephen Gaghan Robert Downey Jr. United States 175000000.0 245487753.0 Universal Pictures 101.0 2020
7651 The Call of the Wild PG Adventure 2020 February 21, 2020 (United States) 6.8 42000.0 Chris Sanders Michael Green Harrison Ford Canada 135000000.0 111105497.0 20th Century Studios 100.0 2020
7652 The Eight Hundred Not Rated Action 2020 August 28, 2020 (United States) 6.8 3700.0 Hu Guan Hu Guan Zhi-zhong Huang China 80000000.0 461421559.0 Beijing Diqi Yinxiang Entertainment 149.0 2020

5436 rows × 16 columns

In [7]:
#Querying data types of each column
df.dtypes
Out[7]:
name              object
rating            object
genre             object
year               int64
released          object
score            float64
votes            float64
director          object
writer            object
star              object
country           object
budget           float64
gross            float64
company           object
runtime          float64
year_released      int64
dtype: object
In [8]:
#Changing the data type of some columns for readability
df['votes'] = df['votes'].astype('int64')
df['budget'] = df['budget'].astype('int64')
df['gross'] = df['gross'].astype('int64')
df['year_released'] = df['year_released'].astype('int64')
df
Out[8]:
name rating genre year released score votes director writer star country budget gross company runtime year_released
0 The Shining R Drama 1980 June 13, 1980 (United States) 8.4 927000 Stanley Kubrick Stephen King Jack Nicholson United Kingdom 19000000 46998772 Warner Bros. 146.0 1980
1 The Blue Lagoon R Adventure 1980 July 2, 1980 (United States) 5.8 65000 Randal Kleiser Henry De Vere Stacpoole Brooke Shields United States 4500000 58853106 Columbia Pictures 104.0 1980
2 Star Wars: Episode V - The Empire Strikes Back PG Action 1980 June 20, 1980 (United States) 8.7 1200000 Irvin Kershner Leigh Brackett Mark Hamill United States 18000000 538375067 Lucasfilm 124.0 1980
3 Airplane! PG Comedy 1980 July 2, 1980 (United States) 7.7 221000 Jim Abrahams Jim Abrahams Robert Hays United States 3500000 83453539 Paramount Pictures 88.0 1980
4 Caddyshack R Comedy 1980 July 25, 1980 (United States) 7.3 108000 Harold Ramis Brian Doyle-Murray Chevy Chase United States 6000000 39846344 Orion Pictures 98.0 1980
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7648 Bad Boys for Life R Action 2020 January 17, 2020 (United States) 6.6 140000 Adil El Arbi Peter Craig Will Smith United States 90000000 426505244 Columbia Pictures 124.0 2020
7649 Sonic the Hedgehog PG Action 2020 February 14, 2020 (United States) 6.5 102000 Jeff Fowler Pat Casey Ben Schwartz United States 85000000 319715683 Paramount Pictures 99.0 2020
7650 Dolittle PG Adventure 2020 January 17, 2020 (United States) 5.6 53000 Stephen Gaghan Stephen Gaghan Robert Downey Jr. United States 175000000 245487753 Universal Pictures 101.0 2020
7651 The Call of the Wild PG Adventure 2020 February 21, 2020 (United States) 6.8 42000 Chris Sanders Michael Green Harrison Ford Canada 135000000 111105497 20th Century Studios 100.0 2020
7652 The Eight Hundred Not Rated Action 2020 August 28, 2020 (United States) 6.8 3700 Hu Guan Hu Guan Zhi-zhong Huang China 80000000 461421559 Beijing Diqi Yinxiang Entertainment 149.0 2020

5436 rows × 16 columns

In [9]:
#Adjusting the value of the budget and gross column considering inflation using the cpi library

def inflation_adjust(data, column):
    return data.apply(lambda x: cpi.inflate(x[column], x.year_released), axis=1)

df['budget_inflation_adjust'] = inflation_adjust(df, 'budget')
df['gross_inflation_adjust'] = inflation_adjust(df, 'gross')
df
Out[9]:
name rating genre year released score votes director writer star country budget gross company runtime year_released budget_inflation_adjust gross_inflation_adjust
0 The Shining R Drama 1980 June 13, 1980 (United States) 8.4 927000 Stanley Kubrick Stephen King Jack Nicholson United Kingdom 19000000 46998772 Warner Bros. 146.0 1980 6.748113e+07 1.669226e+08
1 The Blue Lagoon R Adventure 1980 July 2, 1980 (United States) 5.8 65000 Randal Kleiser Henry De Vere Stacpoole Brooke Shields United States 4500000 58853106 Columbia Pictures 104.0 1980 1.598237e+07 2.090249e+08
2 Star Wars: Episode V - The Empire Strikes Back PG Action 1980 June 20, 1980 (United States) 8.7 1200000 Irvin Kershner Leigh Brackett Mark Hamill United States 18000000 538375067 Lucasfilm 124.0 1980 6.392949e+07 1.912114e+09
3 Airplane! PG Comedy 1980 July 2, 1980 (United States) 7.7 221000 Jim Abrahams Jim Abrahams Robert Hays United States 3500000 83453539 Paramount Pictures 88.0 1980 1.243073e+07 2.963968e+08
4 Caddyshack R Comedy 1980 July 25, 1980 (United States) 7.3 108000 Harold Ramis Brian Doyle-Murray Chevy Chase United States 6000000 39846344 Orion Pictures 98.0 1980 2.130983e+07 1.415198e+08
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7648 Bad Boys for Life R Action 2020 January 17, 2020 (United States) 6.6 140000 Adil El Arbi Peter Craig Will Smith United States 90000000 426505244 Columbia Pictures 124.0 2020 1.017691e+08 4.822782e+08
7649 Sonic the Hedgehog PG Action 2020 February 14, 2020 (United States) 6.5 102000 Jeff Fowler Pat Casey Ben Schwartz United States 85000000 319715683 Paramount Pictures 99.0 2020 9.611522e+07 3.615240e+08
7650 Dolittle PG Adventure 2020 January 17, 2020 (United States) 5.6 53000 Stephen Gaghan Stephen Gaghan Robert Downey Jr. United States 175000000 245487753 Universal Pictures 101.0 2020 1.978843e+08 2.775895e+08
7651 The Call of the Wild PG Adventure 2020 February 21, 2020 (United States) 6.8 42000 Chris Sanders Michael Green Harrison Ford Canada 135000000 111105497 20th Century Studios 100.0 2020 1.526536e+08 1.256345e+08
7652 The Eight Hundred Not Rated Action 2020 August 28, 2020 (United States) 6.8 3700 Hu Guan Hu Guan Zhi-zhong Huang China 80000000 461421559 Beijing Diqi Yinxiang Entertainment 149.0 2020 9.046138e+07 5.217604e+08

5436 rows × 18 columns

In [10]:
#Showing all the digits in the inflation adjusted columns for readability
df['budget_inflation_adjust'] = df['budget_inflation_adjust'].apply(lambda x: int(x))
df['gross_inflation_adjust'] = df['gross_inflation_adjust'].apply(lambda x: int(x))
df
Out[10]:
name rating genre year released score votes director writer star country budget gross company runtime year_released budget_inflation_adjust gross_inflation_adjust
0 The Shining R Drama 1980 June 13, 1980 (United States) 8.4 927000 Stanley Kubrick Stephen King Jack Nicholson United Kingdom 19000000 46998772 Warner Bros. 146.0 1980 67481128 166922641
1 The Blue Lagoon R Adventure 1980 July 2, 1980 (United States) 5.8 65000 Randal Kleiser Henry De Vere Stacpoole Brooke Shields United States 4500000 58853106 Columbia Pictures 104.0 1980 15982372 209024948
2 Star Wars: Episode V - The Empire Strikes Back PG Action 1980 June 20, 1980 (United States) 8.7 1200000 Irvin Kershner Leigh Brackett Mark Hamill United States 18000000 538375067 Lucasfilm 124.0 1980 63929490 1912113534
3 Airplane! PG Comedy 1980 July 2, 1980 (United States) 7.7 221000 Jim Abrahams Jim Abrahams Robert Hays United States 3500000 83453539 Paramount Pictures 88.0 1980 12430734 296396789
4 Caddyshack R Comedy 1980 July 25, 1980 (United States) 7.3 108000 Harold Ramis Brian Doyle-Murray Chevy Chase United States 6000000 39846344 Orion Pictures 98.0 1980 21309830 141519803
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7648 Bad Boys for Life R Action 2020 January 17, 2020 (United States) 6.6 140000 Adil El Arbi Peter Craig Will Smith United States 90000000 426505244 Columbia Pictures 124.0 2020 101769051 482278157
7649 Sonic the Hedgehog PG Action 2020 February 14, 2020 (United States) 6.5 102000 Jeff Fowler Pat Casey Ben Schwartz United States 85000000 319715683 Paramount Pictures 99.0 2020 96115215 361524020
7650 Dolittle PG Adventure 2020 January 17, 2020 (United States) 5.6 53000 Stephen Gaghan Stephen Gaghan Robert Downey Jr. United States 175000000 245487753 Universal Pictures 101.0 2020 197884266 277589508
7651 The Call of the Wild PG Adventure 2020 February 21, 2020 (United States) 6.8 42000 Chris Sanders Michael Green Harrison Ford Canada 135000000 111105497 20th Century Studios 100.0 2020 152653577 125634456
7652 The Eight Hundred Not Rated Action 2020 August 28, 2020 (United States) 6.8 3700 Hu Guan Hu Guan Zhi-zhong Huang China 80000000 461421559 Beijing Diqi Yinxiang Entertainment 149.0 2020 90461379 521760382

5436 rows × 18 columns

In [11]:
#Deleting duplicate rows if any. In tis case, there are no duplicate rows.
df.drop_duplicates()

#Sorting from highest to lowest gross
df_sorted = df.sort_values(by=['gross_inflation_adjust'], inplace=False, ascending=False)
df_sorted
Out[11]:
name rating genre year released score votes director writer star country budget gross company runtime year_released budget_inflation_adjust gross_inflation_adjust
3045 Titanic PG-13 Drama 1997 December 19, 1997 (United States) 7.8 1100000 James Cameron James Cameron Leonardo DiCaprio United States 200000000 2201647264 Twentieth Century Fox 194.0 1997 364679127 4014474018
5445 Avatar PG-13 Action 2009 December 18, 2009 (United States) 7.8 1100000 James Cameron James Cameron Sam Worthington United States 237000000 2847246203 Twentieth Century Fox 162.0 2009 323297310 3883995942
7445 Avengers: Endgame PG-13 Action 2019 April 26, 2019 (United States) 8.4 903000 Anthony Russo Christopher Markus Robert Downey Jr. United States 356000000 2797501328 Marvel Studios 181.0 2019 407519371 3202348267
6663 Star Wars: Episode VII - The Force Awakens PG-13 Action 2015 December 18, 2015 (United States) 7.8 876000 J.J. Abrams Lawrence Kasdan Daisy Ridley United States 245000000 2069521700 Lucasfilm 138.0 2015 302511950 2555326719
209 E.T. the Extra-Terrestrial PG Family 1982 June 11, 1982 (United States) 7.8 381000 Steven Spielberg Melissa Mathison Henry Thomas United States 10500000 792910554 Universal Pictures 115.0 1982 31843290 2404655317
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5640 Tanner Hall R Drama 2009 January 15, 2015 (Sweden) 5.8 3500 Francesca Gregorini Tatiana von Fürstenberg Rooney Mara United States 3000000 5073 Two Prong Lesson 96.0 2015 3704227 6263
2434 Philadelphia Experiment II PG-13 Action 1993 June 4, 1994 (South Korea) 4.5 1900 Stephen Cornwell Wallace C. Bennett Brad Johnson United States 5000000 2970 Trimark Pictures 97.0 1994 9873650 5864
3681 Ginger Snaps Not Rated Drama 2000 May 11, 2001 (Canada) 6.8 43000 John Fawcett Karen Walton Emily Perkins Canada 5000000 2554 Copperheart Entertainment 108.0 2001 8262422 4220
2417 Madadayo NaN Drama 1993 April 17, 1993 (Japan) 7.3 5100 Akira Kurosawa Ishirô Honda Tatsuo Matsumura Japan 11900000 596 DENTSU Music And Entertainment 134.0 1993 24100999 1207
3203 Trojan War PG-13 Comedy 1997 October 1, 1997 (Brazil) 5.7 5800 George Huang Andy Burg Will Friedle United States 15000000 309 Daybreak 85.0 1997 27350934 563

5436 rows × 18 columns

Data Analysis

In [12]:
#Correlation Matrix between selected numeric columns
selected_numeric = ['year_released','score','votes','budget_inflation_adjust','gross_inflation_adjust','runtime']
correlation_matrix = df[selected_numeric].corr(method='pearson')
correlation_matrix
Out[12]:
year_released score votes budget_inflation_adjust gross_inflation_adjust runtime
year_released 1.000000 0.061029 0.202883 0.159099 0.159096 0.074432
score 0.061029 1.000000 0.473809 0.059347 0.242249 0.414580
votes 0.202883 0.473809 1.000000 0.421948 0.635112 0.352437
budget_inflation_adjust 0.159099 0.059347 0.421948 1.000000 0.677118 0.339182
gross_inflation_adjust 0.159096 0.242249 0.635112 0.677118 1.000000 0.282979
runtime 0.074432 0.414580 0.352437 0.339182 0.282979 1.000000
In [13]:
#Creating a heatmap to visualize the correlation matrix
sns.heatmap(correlation_matrix, annot = True, cmap="mako")
plt.title("Correlation matrix of Numeric Features")
plt.show()
In [14]:
#Highest correlation is between gross and budget
#Plotting a scatter plot with a linear regression line using seaborn
sns.regplot(x="budget_inflation_adjust", y="gross_inflation_adjust", data=df, scatter_kws={"color":"SteelBlue"}, line_kws={"color":"DarkSeaGreen"})

#Calculating the correlation coefficient and rounding to 6 decimal places
corr_coeff1 = round(df['budget_inflation_adjust'].corr(df['gross_inflation_adjust']), 6)

#Creating the annotation text
text = f'Correlation Coefficient: {corr_coeff1}'

#Adding the annotation to the plot
plt.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction')

#Adding labels
plt.title('Gross vs. Budget')
plt.xlabel("Budget (inflation adjusted)")
plt.ylabel("Gross (inflation adjusted)")

#Displaying the plot
plt.show()

The correlation coefficient of 0.677118 suggests a moderate to strong positive correlation between a movie's gross revenue and its budget. This correlation implies that, on average, movies with higher budgets have a higher likelihood of generating higher gross revenue.

It is important to note that correlation does not necessarily imply causation. While a higher budget may contribute to a movie's success and marketing efforts, there are other factors at play that can influence a film's gross revenue, such as the quality of the script, direction, acting, marketing strategy, competition, release timing, and audience reception.

However, a positive correlation between gross and budget suggests that investing more resources into a movie's production and marketing may increase the chances of generating higher revenue. Larger budgets often allow for better production values, elaborate visual effects, renowned actors, and extensive marketing campaigns, all of which can attract a wider audience and potentially result in higher ticket sales and other revenue streams such as merchandise and licensing.

It is important to consider that this correlation may not hold true for every movie. There will always be exceptions where movies with lower budgets perform exceptionally well at the box office, and vice versa. Additionally, other factors like genre, target audience, critical reception, and competition within the industry can significantly impact a movie's financial success.

In [15]:
#2nd highest correlation is between number of votes and gross
#Plotting a scatter plot with a linear regression line using seaborn
sns.regplot(x="gross_inflation_adjust", y="votes", data=df, scatter_kws={"color":"SteelBlue"}, line_kws={"color":"DarkSeaGreen"})

#Calculating the correlation coefficient and rounding to 6 decimal places
corr_coeff2 = round(df['gross_inflation_adjust'].corr(df['votes']), 6)

#Creating the annotation text
text = f'Correlation Coefficient: {corr_coeff2}'

#Adding the annotation to the plot
plt.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction')

#Adding labels
plt.title('No. of Votes vs. Gross')
plt.xlabel("Gross (inflation adjusted)")
plt.ylabel("No. of Votes")

#Displaying the plot
plt.show()

The correlation coefficient of 0.635112 between the gross revenue of a movie and the number of votes it receives suggests a moderate positive relationship between these two variables. This means that there is a tendency for movies with higher gross revenue to also receive a larger number of votes.

The positive correlation implies that as the gross revenue of a movie increases, there is a higher likelihood for the movie to also accumulate more votes. This suggests that movies that perform well financially, in terms of higher ticket sales, streaming revenue, DVD sales, and other revenue streams, tend to attract a larger audience and generate more votes.

The moderate correlation coefficient of 0.635112 indicates a reasonably strong relationship between the gross revenue and the number of votes. It suggests that there is a connection between the financial success of a movie and the level of audience engagement, as reflected by the number of votes it receives.

However, it's important to note that correlation does not imply causation. While the correlation suggests that movies with higher gross revenue tend to receive more votes, other factors can also influence this relationship. For example, factors such as marketing efforts, genre appeal, critical reception, release timing, and overall quality of the film can all impact both the gross revenue and the number of votes a movie accumulates.

In [16]:
#3rd highest correlation is between number of votes and score
#Plotting a scatter plot with a linear regression line using seaborn
sns.regplot(x='votes', y='score', data=df, scatter_kws={"color":"SteelBlue"}, line_kws={"color":"DarkSeaGreen"})

#Calculating the correlation coefficient and rounding to 6 decimal places
corr_coeff3 = round(df['votes'].corr(df['score']), 6)

#Creating the annotation text
text = f'Correlation Coefficient: {corr_coeff3}'

#Adding the annotation to the plot
plt.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction')

#Adding labels
plt.title('Score vs. No. of Votes')
plt.xlabel('No. of Votes')
plt.ylabel('Score')

#Displaying the plot
plt.show()

The correlation between Score and No. of Votes (0.473809) indicates that there is a weak positive relationship between the score of a movie and the number of votes it receives. This means that movies that receive more votes tend to have a higher score.

However, it's crucial to understand that the weak correlation coefficient of 0.473809 suggests that the relationship between the score and the number of votes is not a strong predictor. This means that the number of votes a movie receives does not significantly determine its rating, and vice versa. A movie can receive a large number of votes but still have a low score, or vice versa. Therefore, it is important to consider other factors that may influence the score of a movie, such as the genre, plot, acting, and direction, among others.

In [17]:
#Summarizing by Genre
bygenre = df.groupby('genre').agg({
    'name': 'count',
    'score': 'mean',
    'budget_inflation_adjust': 'mean',
    'gross_inflation_adjust': 'mean'
})

#Showing all the digits in the inflation adjusted columns for readability
bygenre['budget_inflation_adjust'] = bygenre['budget_inflation_adjust'].apply(lambda x: int(x))
bygenre['gross_inflation_adjust'] = bygenre['gross_inflation_adjust'].apply(lambda x: int(x))

#Renaming column names
bygenre = bygenre.rename(columns={'name': 'movie count', 'score': 'mean score', 'budget_inflation_adjust': 'mean budget', 'gross_inflation_adjust': 'mean gross'})

#Sorting by mean score
bygenre_sorted = bygenre.sort_values('mean score',ascending=False)

#Displaying sorted data
bygenre_sorted
Out[17]:
movie count mean score mean budget mean gross
genre
Biography 312 7.084936 38713742 90650204
Drama 869 6.723590 38210333 97235517
Animation 278 6.695683 106559080 386520569
Crime 400 6.690250 38180888 80763881
Family 4 6.675000 71147707 985405770
Mystery 17 6.670588 51723053 185561825
Romance 5 6.580000 40070013 52899276
Sci-Fi 6 6.350000 43117574 55340360
Adventure 327 6.268196 71892995 200162330
Action 1417 6.247212 87010877 246321671
Comedy 1496 6.190709 38035571 97351092
Fantasy 42 6.004762 31609345 70088598
Western 2 5.950000 26806289 21153639
Thriller 7 5.928571 23029084 64544680
Horror 254 5.825197 21679940 83787647

Despite having mean budget and mean gross values that fall in the middle range compared to other genres, the Biography genre stands out with the highest mean score. This suggests that a movie's popularity, as reflected by its gross, does not necessarily correlate with its score which is also indicated by a weak correlation coeficient of 0.242249 in the previous correlation matrix.

On the other hand, the Family genre showcases the highest mean gross, which can be attributed to its broader target audience compared to other genres. However, it is important to note that this finding is based on the analysis of only four Family movies included in the dataset, indicating the need for further studies to validate this observation.

Furthermore, Animation movies have the highest mean budget and the second highest mean gross. This suggests that the genre invests significantly in production costs, which potentially contributes to its financial success.

In [18]:
#Plotting the number of movies by genre with a score of at least 8

#Filtering movies with a score of at least 8
highscore_df = df[df['score'] >= 8]

#Calculating the movie count by genre
genre_counts = highscore_df['genre'].value_counts()

#Generating a purple to blue gradient colormap
cmap = plt.colormaps['PuBu']

#Normalizing the data for mapping to colormap
norm = plt.Normalize(np.min(genre_counts.values), np.max(genre_counts.values))

#Creating a bar graph
plt.bar(genre_counts.index, genre_counts.values, color=cmap(norm(genre_counts.values)))

#Setting the labels and title
plt.xlabel('Genre')
plt.ylabel('Movie Count')
plt.title('Movie Count by Genre with Score >= 8')

#Displaying the bar graph
plt.show()

In contrast to its ranking in terms of mean gross in the previous table, the Drama genre has the most number of highly-rated movies. This further supports the previous observation that a movie's popularity, as reflected by its gross, does not necessarily correlate with its score.

Additionally, despite having the highest movie count in this dataset, the Comedy genre shows a relatively low number of movies with a high score. This suggests that while Comedy movies may be abundant, only a few of them manage to attain a high score. It indicates that the primary focus of the Comedy genre might be on providing entertainment value rather than delivering profound storytelling or cinematic excellence.

Summary

In summary, it is found that there is a moderate to strong positive correlation between a movie's gross revenue and its budget which implies that, on average, movies with higher budgets have a higher likelihood of generating higher gross revenue. Additionally, the correlation between the gross revenue of a movie and the number of votes it receives suggests a moderate positive relationship between these two variables which means that there is a tendency for movies with higher gross revenue to also receive a larger number of votes. Also, there is a weak positive relationship between the score of a movie and the number of votes it receives which suggests that movies that receive more votes might have a higher score but because of the weak correlation, the number of votes a movie receives does not significantly determine its rating, and vice versa. It is also found that a movie's popularity, as reflected by its gross, does not necessarily correlate with its score which is also indicated by a weak correlation coeficient.

While these measures provide insights into the performance of different genres, it is crucial to conduct comprehensive analyses and consider additional factors to gain a deeper understanding of the dynamics within the movie industry. It highlights the importance of considering other factors, such as critical acclaim and audience reception, to gain a comprehensive understanding of a movie's overall impact and quality.